In the codebook, we walk you through the data cleaning, data augmentation, and predictive modeling process through which we select out the targeted brand Walgreens for downtown Rochester.
The outline of this codebook is as followed: </br>
# Install Package to reverse geocoding.
!pip install geopy
# Load Packages
import numpy as np
import pandas as pd
# Load Business Location Data
sm = pd.read_csv(".../POIN_MASTER_010319.csv")
# pre-selected targetted list from the business location data.
# in the following analysis, our group only focus on those stores.
chain_store_list = ['GEOCVS','GEODD','GEODICK','GEOHDPT','GEOHGOOD','GEOJCP',
'GEOKOHL','GEOKRGR','GEOLOWE','GEOMACYS','GEOMASS','GEOODEPT',
'GEOROSS','GEOSAFEW','GEOSAV','GEOSEAR','GEOTJ','GEOTRGT',
'GEOWALG','GEOWMT']
# remove unrelevant business from the original data frame.
sm = sm[sm['gitext'].isin(chain_store_list)]
sm.info()
In this data frame, there are 7392 rows that have zipcode information as NaN. Since we will use zip code as our primary granularity, we have to adopt the reverse-geocoding method to retrieve the zip codes.
However, since the free reverse geocoding package is time-consuming, we do not need to retrieve the zipcode information that is already available in the data frame. Thus, we only reverse-geocoded those rows that have no zip code information.
sm_isna = pd.read_csv(".../temp_sm_isna.csv")
sm_isnotna = sm[sm['zip'].isna() == False]
sm_isna.info()
-- reverse geocode is the method to retrieve zip code information from spatial coordinates
we use the geopy.geocoders packages
# Import Packages
from geopy.geocoders import Nominatim
import geopy.geocoders
from geopy.geocoders import Nominatim
geolocator = Nominatim(timeout=3)
from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(geolocator.reverse, min_delay_seconds=1)
for i in range(len(df)):
try:
df.loc[i,'zipcode'] = geocode(str(df.iloc[i]['latitude']) + ',' + str(df.iloc[i]['longitude'])).raw['address']['postcode']
print(str(i) + ' '+ str(df.iloc[i]['zipcode']))
except KeyError:
print(KeyError)
continue
We get the full dataset after utilizing the above method on those rows missing zip code information, as well as hardcoding those zip codes that cannot be retrieved from the above method.
sm_new = pd.concat([sm_isna,sm_isnotna],axis=0)
sm_new.drop(columns=['Unnamed: 0'],inplace=True)
# sm_new.to_csv("reverse_geocodone.csv")
Although our primary focus is on zipcode-level data, we still need to retrieve city and state information for our target stores in order to conduct peer-city analysis introduced in our pitch deck.
To retrieve city and state data we use a different reverse geocode method instead. Below are the Python codes:
!pip install reverse_geocoder
!pip install pprint
import reverse_geocoder as rg
import pprint
import time
df['cor'] = list(zip(df.latitude, df.longitude)) # format the zipcode in a standard form.
def find_city_name(cor):
city_name = rg.search(cor)[0]['name']
return city_name
def find_state(cor):
state_name = rg.search(cor)[0]['admin1']
return state_name
Our final dataset for peer city analysis can be found in our hand-in file: peercity.csv
-- In this section, we further processed the data. We found out some exotic zipcode patterns and we need to adjust them into a uniform format or to drop them. Those exotic zip codes can be summarized by the following formats:
# sm_new = pd.read_csv(".../reverse_geocodone.csv")
# sm_new.drop(columns=['Unnamed: 0'],inplace=True)
f = sm_new[['gitext', 'poiname', 'addr1', 'city', 'state','zip','latitude','longitude']]
# Remove canadian zipcode
f = f[~f.zip.str.match('[a-zA-Z]')]
# select out the rows that zip is like xxxxx-xxxx
a = f[f.zip.str.match('^\d{5}-')]
a['zip'] = a['zip'].apply(lambda x:x[:5])
# select out the rows that zip is like xxxxx:xxxxx
b = f[f.zip.str.match('^\d{5}:\d{5}')]
b['zip'] = b['zip'].apply(lambda x:x[:5])
# select out the rows that zip length is less than 5 and add 0 in front of it
c = f[f['zip'].apply(lambda x: len(x)<5)]
c['zip'] = c['zip'].apply(lambda x: '{0:0>5}'.format(x))
# select out the rows that zip is like xxxxx:...
d = f[f.zip.str.match('^\d{5};')]
d['zip'] = d['zip'].apply(lambda x:x[:5])
# zipcode that is already in a desired format
e = f[f.zip.str.match('^\d{5}$')]
# concat them together to get the final business location dataset
final = pd.concat([a,b,c,d,e],axis=0)
# final.to_csv("full_zip.csv")
final.info()
The American Community Survey (ACS) is an ongoing survey by the U.S. Census Bureau. It regularly gathers information previously contained only in the long form of the decennial census, such as ancestry, citizenship, educational attainment, income, language proficiency, migration, disability, employment, and housing characteristics. These data are used by many public-sector, private-sector, and not-for-profit stakeholders to allocate funding, track shifting demographics, plan for emergencies, and learn about local communities. Sent to approximately 295,000 addresses monthly (or 3.5 million per year), it is the largest household survey that the Census Bureau administers.
In this section, we show how we use census Application Programming Interface (API) tools to extract our pre-selected ACS attributes for zipcodes that are generated through the reverse geocoding process.
The main Python Code is shown below:
import requests
import json
url = f'https://api.census.gov/data/2018/acs/acs5?key={apiKey}&get={filed}&for=zip%20code%20tabulation%20area:{a}'
response = requests.get(url)
#load the response into a JSON, ignoring the first element which is just field labels
formattedResponse = json.loads(response.text)[1:]
#flip the order of the response from [population, zipcode] -> [zipcode, population]
formattedResponse = [item[::-1] for item in formattedResponse]
Notes:
you can get your apiKey through: https://www.census.gov/developers/ and click the Request a KEY button on the left.
in the url: filed is the pre-selected attributes that you want to get, and attributes of ACS can be found in this link: https://api.census.gov/data/2018/acs/acs5/variables.html
You can also extract data on a different granularity rather than zipcode.
The maximum number of zipcode you can retrive from one api call is 1000. So in the code below I write a for loop to retrive data for 10200 zipcode.
# load the zipcodes from that are generated from the reverse-geocoding process.
fl = open(".../zip.txt",'w')
for i in zipz:
fl.write(i+'\n')
fl.close()
zips = open('.../zip.txt', 'r').readlines()
zips = [z.replace('\n', '') for z in zips]
# print(len(zips))
# load the pre-selected attributes.
filed = open('.../field of interest.txt', 'r').readlines()
filed = [h.replace('\n', '') for h in filed]
print(len(filed))
filed = ','.join(filed)
df = pd.DataFrame(columns=["zipcode","B02001_001E","B02001_002E","B02001_003E","B02001_004E","B02001_005E",
"B02001_006E","B02001_007E","B02001_008E","B01003_001E","B08301_002E","B08301_003E",
"B08301_004E","B08301_005E","B08301_006E","B08301_007E","B08301_008E","B08301_009E",
"B08301_010E","B08301_011E","B08301_012E","B08301_013E","B08301_014E","B08301_015E",
"B08301_016E","B08301_017E","B08301_018E","B08301_019E","B08301_020E","B08301_021E",
"B08014_002E","B08014_003E","B08014_004E","B08014_005E","B08014_006E","B08014_007E"])
for i in range(102):
a = zips[i*100:(1+i)*100]
a = ','.join(a)
url = f'https://api.census.gov/data/2018/acs/acs5?key={apiKey}&get={filed}&for=zip%20code%20tabulation%20area:{a}'
response = requests.get(url)
#load the response into a JSON, ignoring the first element which is just field labels
formattedResponse = json.loads(response.text)[1:]
#flip the order of the response from [population, zipcode] -> [zipcode, population]
formattedResponse = [item[::-1] for item in formattedResponse]
#store the response in a dataframe
zippp = pd.DataFrame(columns=["zipcode","B02001_001E","B02001_002E","B02001_003E","B02001_004E","B02001_005E",
"B02001_006E","B02001_007E","B02001_008E","B01003_001E","B08301_002E","B08301_003E",
"B08301_004E","B08301_005E","B08301_006E","B08301_007E","B08301_008E","B08301_009E",
"B08301_010E","B08301_011E","B08301_012E","B08301_013E","B08301_014E","B08301_015E",
"B08301_016E","B08301_017E","B08301_018E","B08301_019E","B08301_020E","B08301_021E",
"B08014_002E","B08014_003E","B08014_004E","B08014_005E","B08014_006E","B08014_007E"],
data=formattedResponse)
df = pd.concat([df,zippp],axis=0)
df = df[df['B02001_001E'].isna() == False]
In this section, we combine the ACS data, mosaic data, and business location data into a format for our modelling process.
ma = pd.read_csv(".../mosaic_byZip.csv")
# format the mosaic data's zipcode column into standard xxxxx
ma['zip'] = ma['zip'].apply(lambda x: '{0:0>5}'.format(x))
cs_and_ma = df.merge(ma,how='left',left_on='zipcode',right_on='zip')
cs_and_ma.drop(columns=['zip'],inplace=True)
# Creating wide table for further multi-labeled machine learning
final = final[final['zip'].isin(zip_final)]
final = final[['gitext','zip']]
final = final.astype(str)
final['existence'] = [True] * len(final)
final_wide = final.pivot_table(index='zip',columns='gitext',values='existence')
final_wide.fillna(False,inplace = True)
# Combine the business location data with the census + mosaic dataset
final = final_wide.reset_index()
final = final.merge(cs_and_ma,how='left',left_on='zip',right_on='zipcode')
final.drop(columns=['zipcode'],inplace=True)
# final.to_csv("fulldata.csv")
In this section we find an optimal machine learning model to make a prediction for the probability of our 20 targeted brands on Rochester downtown zipcodes: 14604, 14608, 14614.
raw = final
# we drop columns where either do not contain ACS information or Mosaic information.
raw.dropna(axis=0,inplace=True) # there are 9855 zipcodes left.
# create our multi-lable dependent variables
label = raw.loc[:,'GEOCVS':'GEOWMT']
y = label*1
# y.shape
# create our independent variabels into X
X = raw.loc[:,'B02001_001E':'political_affiliation_p111975_n']
X.shape
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_stan = scaler.fit_transform(X)
In this step we tried random forest multi-lable classification and several deep neural network to find the optimal predicitive model.
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_stan, y, test_size=0.2, random_state=101)
from sklearn.metrics import accuracy_score,roc_curve, auc, roc_auc_score
from sklearn.model_selection import KFold, cross_val_score
from sklearn import metrics
from sklearn.metrics import classification_report
n_folds = 10
def get_CVacc(model):
"""
Return the accuracy score
"""
# Set KFold to shuffle data before the split
kf = KFold(n_folds, shuffle=True, random_state=42)
# Get accuracy score
accuracy_score = cross_val_score(model, X_stan, y, scoring="accuracy", cv=kf)
return accuracy_score.mean()
def get_acc(model):
"""
Return accuracy score
"""
model.fit(X_train, y_train)
predictions = model.predict(X_test)
acc = accuracy_score(y_test,predictions)
print(classification_report(y_test, predictions))
return acc
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=500)
print("accuracy_rfc: ", get_acc(rfc))
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout,BatchNormalization
from tensorflow.keras.constraints import max_norm
# Model 1:
model = Sequential()
# input layer
model.add(Dense(246, activation='sigmoid'))
model.add(Dropout(0.5))
# hidden layer
model.add(Dense(64, activation='sigmoid'))
model.add(Dropout(0.5))
# hidden layer
model.add(Dense(128, activation='sigmoid'))
model.add(Dropout(0.5))
# hidden layer
model.add(Dense(32, activation='sigmoid'))
model.add(Dropout(0.5))
# output layer
model.add(Dense(units=20,activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
# Model 2:
model2 = Sequential()
model2.add(Dense(246, input_dim= 246, activation='sigmoid'))
model2.add(Dense(128, activation='sigmoid'))
model2.add(Dense(64, activation='sigmoid'))
model2.add(Dropout(0.5))
model2.add(Dense(64, activation='sigmoid'))
model2.add(Dense(32, activation='sigmoid'))
model2.add(Dropout(0.5))
model2.add(BatchNormalization())
model2.add(Dense(20, activation='softmax'))
# Compile model
model2.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
# Model 3:
model3 = Sequential()
model3.add(Dense(246, kernel_initializer="uniform", activation='sigmoid'))
model3.add(Dense(64, activation='sigmoid'))
model3.add(Dense(32, activation='sigmoid'))
model3.add(Dense(20, activation='sigmoid'))
model3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# After comparison we use the first model:
model.fit(x=X_train,
y=y_train,
epochs=20,
batch_size=128,
validation_data=(X_test, y_test),
)
test = pd.read_csv('.../roc.csv')
testset = test.loc[:,'B02001_001E':'political_affiliation_p111975_n']
model.fit(X_stan, y)
predictions_nn = model.predict(testset)
output_nn = pd.DataFrame({
'14608': predictions_nn[0],
'14604': predictions_nn[1],
'14614': predictions_nn[2]},index=y.columns)
output_nn['mean']=np.mean(output_nn,axis=1)
output_nn.sort_values(by='mean',ascending=False,inplace=True)
output_nn
In this selection, combining with our peer-city analysis, we choose Walgreens as our final target.
We will use binary classification machine learning models to justify that it is promising to open Walgreens at downtown Rochester, and to offer some insights on the site-selection process introduced in our pitch deck.
label2 = raw.loc[:,'GEOWALG']
y2 = label2*1
x_train, x_test, y2_train, y2_test = train_test_split(X_stan, y2, test_size=0.2, random_state=101)
def plotROC(model):
"""
1. Plot ROC AUC
2. Return the best threshold
"""
model.fit(x_train, y2_train)
predictions = model.predict(x_test)
probs = model.predict_proba(x_test)
preds = probs[:,1]
fpr, tpr, threshold = roc_curve(y2_test, preds)
roc_auc = auc(fpr, tpr)
# Plot ROC AUC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
# report
acc = accuracy_score(y2_test,predictions)
print(classification_report(y2_test, predictions))
# Find optimal threshold
rocDf = pd.DataFrame({'fpr': fpr, 'tpr':tpr, 'threshold':threshold})
rocDf['tpr - fpr'] = rocDf.tpr - rocDf.fpr
optimalThreshold = rocDf.threshold[rocDf['tpr - fpr'].idxmax()]
return acc
rfc2 = RandomForestClassifier(n_estimators=500)
print("accuracy_rfc: ", plotROC(rfc2))
from lightgbm import LGBMClassifier
lgb = LGBMClassifier(objective='binary',
learning_rate=0.049,
n_estimators=1500,
num_leaves=8,
min_data_in_leaf=4,
max_depth=3,
max_bin=41,
bagging_fraction=0.845,
bagging_freq=5,
feature_fraction=0.24,
feature_fraction_seed=9,
bagging_seed=9,
min_sum_hessian_in_leaf=11)
print("accuracy_lgb: ", plotROC(lgb))
Since the LightGBM generates a slightly higher result, we choose LightGBM as our final walgreen-prediction model.
lgb.fit(X_stan, y2)
predictions = lgb.predict_proba(testset)[:, 1]
predictions
output = pd.DataFrame({'zipcode': test.zipcode,
'probability': predictions})
output
It turns out that three downtown zip codes all have very high probabilities on opening Walgreens, which justifies our previous brand selection model.
Thanks for your patience to read through our codebook, we sincerely wish that our explanation of our data analytical process is clear to you. Meanwhile, we wish this codebook serves as a strong appendix to our final pitch deck presentation.